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We regret to point out several inaccurate and mis- 
leading statements that Benedetto et al. make in their 
ReplyjlJ to our Comment H on their paper titled "Lan- 
guage Trees and Zipping" |3| • 

First they confusingly state in paragraph 7 that Rus- 
sian and Greek alphabets are not phonetic, putting Rus- 
sian and Greek in a row with Chinese, the latter enjo- 
ing hieroglyphic writings. Second, they use unfair and 
irrelevant experiments in order to convince the reader 
that the gzip-based approach is better than the Markov 
chains based approach. Third, the figures reported for 
Newsgroups corpus seems to be obtained on a randomly 
selected small subset of the Newsgroups corpus, which 
probably makes them completely meaningless in the dis- 
cussed topic. Fourth, their reference to RAR compressor 
classification performance for refuting our Comment is ir- 
relevant to our Comment and their Letter|3j. And fifth, 
authors of |lj obviously experience some problems with 
scientific English language. We elaborate on each of these 
points in more detail in the subsequent paragraphs. 

It is a well-established fact that Russian language 
as well as Greek enjoys phonetic alphabet. Perhaps, 
Benedetto et al. Q meant to use the transliteration 
for the construction of Language Tree (LT). However 
this procedure has its drawbacks like non-uniqueness, 
non-reversability, and inexactness of the transformation. 
Most importantly this procedure requires some knowl- 
edge about the language, which shows that the require- 
ment for a-priori information, pointed out in [2j remains 
valid contrary to the claim in [3j . 

We believe that if one wants to compare the perfo- 
mance of several classification methods then the com- 
parision should be performed in the same experimen- 
tal framework. To start with, let us denote by M, G, 
g the classification performance of the following meth- 
ods, respectively, on the corpus, discussed in |4|: Markov 
Chains approach p|; attribution with a single source us- 
ing gzip 5[; and attribution with multiple-source us- 
ing gzip 31- Let us denote by M', G', g' the clas- 
sification performance of these methods in the frame- 
work of [3, and, finally, let M", G", g" denote the 
classification performance of the same methods on the 
Newsgroups dataset. Notice that in only values for 
G" = 60% < g" = 85% and G" = 77% < g" = 93% are 



presented. One can not make any conclusion about M' or 
M" using these data, so our statement about superiority 
of Markov Chains approach with respect to gzip approach 
(either with a single- or multiple-source files) remains 
valid. Moreover, we have stated in our Comment Q that 
M = 69/82 re 84% is greater than G = 50/82 « 61%. 
We also reported to editors of Phys. Rev. Lett, in our 
answer to the referee report of Benedetto et al. that 
g — 53/82 ps 65%, which can indeed be considered as an 
argument for our claim that generally Markov chains are 
more attractive than gzip-based approach. 

In our opinion the "slightly different method" of Q 
should be considered as an approach to the design of the 
experiment, which leads to an extremely slow classifica- 
tion speed especially in the case of thousands of docu- 
ments to classify, where a thousand source documents 
makes prohibitive the really large experiment on clas- 
sification. This gives rise to the question of the valid- 
ity of the figures G" = 60% and g" = 85% outlined 
in 0. Traditionally, the precision of the classification 
method on the Newsgroups is measured in the follow- 
ing way: one performs a random 10-fold or 5-fold split 
and reports the average results of cross-validation. Typ- 
ical numbers reported are around 80% 0], with 82.1% 
for PPM (Markov-based) approach. It would be inter- 
esting to know the technique used by[lj, since even 5- 
fold split validation by their method would require about 
5 x (18828/5) x (4 x 18828/5) « 284 x 10 6 calls of gzip 
compression program, which is prohibitive on conven- 
tional computers. If one wants to apply a complete cross- 
validation as suggested in then one has to do even 
more 18828 2 w 354 x 10 6 calls of gzip. We suspect that 
the figures G" = 60% and g" = 85%, outlined in Q, 
are obtained on a randomly selected small subset of the 
Newsgroups, are subject to essential random variation, 
and hence G" = 60% and g" = 85% should not be used 
for quantitative comparision of a single- and multiple- 
source file setting. 

Finally, in our Comment we stated that Markov chain 
approach as reported in is superior to LZ approach 
used in 3J. This statement was misinterpreted in Q] as 
a general statement that LZ approach is outperformed 
by the simple Markov chain approach and Benedetto 
et aL0] easily refute the misinterpreted statement using 
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our own result on RAR 5]. The correct generalization 
(and the only possible understanding in view of references 
given) of our statement is: for any modification of LZ 
compression scheme there exists a modification of Markov 
Chain approach (PPM compression scheme), which out- 
performs LZ in classification (this statement is similar 
to a well-known postulate among specialists: any modi- 
fication of LZ compression scheme can be outperformed 
by a properly modified PPM compression scheme). The 
highly sophisticated, going far beyond the naive use of 
Ziv-Lempel theory, algorithm of RAR, know-how of its 
creator Eugene Roshal, should be compared with, for 
example, the the state-of-art PPMd (PPMonstr) algo- 
rithm developed recently by Dmitry Shkarin. And we 
find extremely interesting and scientifically valuable that 
the tough first-order Markov chain produce results com- 
petitive to highly sophisticated algorithms. As for the 
polemical comparision between Markov Chain and RAR 
compressor by Benedetto et aLMj , we find it irrelevant in 
the framework of their paper [2j . Indeed, if Benedetto et 
aL0] stand for technical details, like multiple- and single- 
source classification, they should restrict their method to 
application of gzip only, which is the main technical de- 
tail of their Letter || ■ 

As a final remark we would like to point out that Re- 
ply [l| exhibits some language mistakes of it's authors 
themselves. Indeed, they reference to our comment Q 
using expression "Khmelev et aV as if Q has at least 
three co-authors (common meaning of et al is and oth- 
ers). 

To sum up, one can not draw any conclusion on the 
comparision between nth order Markov chain approach 
(by which we meant the PPM approach as well) with 
gzip-based approach from the statements, given in 
We also believe that the authors of jjj] were not aware of 
our reported figure for g = 64%; otherwise it looks very 
strange that they did not mention this argument in their 
Reply. Also we suggest to authors of jlj to present a fair 
comparision of their method against others, @, and, 
e.g., SVM approach [HEH- 

P.S. This story shows that editors of physical journal 
like Phys. Rev. Lett, perhaps should avoid publishing pa- 
pers like because Phys. Rev. Lett, referees do not have 
enough experience to identify scientifical value and mis- 
takes in non-physical papers. We also encourage physi- 
cists and mathematicians to send their non-physical and 
non-mathematical papers to appropriate scientific jour- 



nals, even if they are not so well-known as Phys. Rev. Lett. 
Probably such a publication would not yield much pub- 
licity, but the quality and scientific value of the paper 
would increase significantly. 

The example with 0] is not unique. A similar story, 
which reappears time-to-time in newspapers, is the s tory 
about computing using DNA, described in details in [Hj. 
It is possible to do computations with DNA. However, 
the amount of DNA, required for solution of, say, sales- 
men problem with 100 cities, is comparable with the 
Earth mass, which makes it's use impractical and im- 
possible. Notice that computer science methods allow 
to solve practical salesmen problem in reasonable time 
for number of cities like 10 6 on contemporary computers. 
However, the authors of DNA computation speculative 
approach speculate that the effectiveness issue will be 
solved in future, a strange analogy with suggestion of Q]. 

Notice also that a publication of non-physical pa- 
per in physical journal evidence the crisis in physics, 
which responsible phisicists should aware of. Otherwise 
why phisicists are publishing speculative papers on non- 
physical subjects? Is not this the evidence that they can 
not find application of their abilities in physics? 
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